I am feeling curious, excited and a bit nervous as well. I expect to learn about programming and data analysis in R langauage, so far I have been using spss. I want to also learn how to analyse big data and find patterns in data that could lead to new findings. I found the course through UEF WebOodi.
different formats
image
Here is the link to my GitHUB repository https://github.com/saranglq/IODS-project
Here is the link to my diary page https://saranglq.github.io/IODS-project/
Another link to my diary page
Describe the work you have done this week and summarize your learning.
I started this week with learning R through the Data Camp platform. I learned how to handle (wrangle) data in R. I learned how to modify existing datasets, merge different datasets, and carry out simple and multiple linear regressin analyses. I also learned how to make graphical representations of these analyses.
The dataset given to us was from the “International survey of Approaches to Learning” funded by the Teacher’s Academy funding for Kimma Vehkalahti. Data was collected during 2013-2015
Oridinal dataset had 183 observations for 60 variables. Variables with prefix A adn C were based on componenets of A and C parts of ASSIST (Approaches and Study Skills Inventory for Students). Prefix D variables were based on SATS (Survey of Attitudes Toward Statistics). Items from the ASSIST B part were named so that their connection to the relevant dimension (Deep/Surface/Strategic) was maintained
The dataset had multiple variables and I was supposed to merge assist B components into three major variables (Deep, Surface, Strategy). I also renamed three other variables (Age Attitude, Points -> age, attitude, points). Then I had to exclude other variables after selecting only 7 variables. The resulting dataset I saved as lrn2014.csv which has 7 variables and 166 observations for each variable.
Gender distribution was uneven, there were 110 female and 56 male subjects.
Distributions of each variable (except gender) are given below:
Following is a summary table for all variables:
| Variable | Mean | Median | Min | Max |
|---|---|---|---|---|
| Age | 25.51 | 2.83 | 17 | 55 |
| Attitude | 31.43 | 32 | 14 | 50 |
| Deep | 3.68 | 3.18 | 1.58 | 4.91 |
| Strategy | 3.12 | 3.18 | 1.25 | 5.00 |
| Surface | 2.78 | 2.83 | 1.58 | 4.33 |
| Points | 22.72 | 23.00 | 7.00 | 33.00 |
I first carried out an exploratory analysis to find out which variables correlate best with points:
Attitude, strategy and surface had the strongest corellation and therefore were chosen as the predictor variables
I ran a multiple linear regression model and got following results
According to the results, attitude is the only variable that has significant linear relationship with points (p < 0.001). Each 1 point increase in attitude results in an increase of 0.34 in points.
Non-significant variables were removed and model was fitted again.
According to the final model, attitude has still a significant linear relationship with points, and each 1 point increase in attitude results in an increase of 0.35 in points. Multiple R-squared value is 0.19 which means that attitude explains 19 percent of variance in points.
Assumptions of the multiple linear model are that the
Errors are normally distributed Errors are not correlated Errors have constant variance The size of a given error does ot depend oin the explanatory variables
Diagnostic plots were created to check these assumptions
QQ-plot demostrates that the errors of the model are normally distributed, there are howeevr values at extremities that deviate from the normal line.
The scatterplot of residuals vs the fitted values shows no pattern, i.e. the spread remains largely similar with the increase in fitted value. Howeverthere are some residuals that are far form the fitted line. These could potentially be outliers
The residuals vs leverage plot shows that none of the eobservations have an unusually high impact on the results.
The current dataset is called “Student Performance Data Set” and can be downloaded from https://archive.ics.uci.edu/ml/datasets/Student+Performance. Data compares student performance in two subjects: Portuguese and Mathematics.G1 referes to grades received during the first period and similary G2 and G3 correspond to grades received in second and third periods.
library(dplyr)
library(tidyr)
library(ggplot2)
library(GGally)
library(boot)
After data wrangling the dataset was saved as alc.csv which contains 35 variable and 182 observations
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
## [1] 382 35
I would be interested in examining the relationship between alcohol use and:
Parental cohabitation: I think that tense family relationships can lead to a higher stress in teenagers and therfore a tendency towards more consumption of alcohol.
Mother’s education: Research shows that higher education for women results in multipe positive outcomes for family and society. This could also reflect in the attitudes of children.
Romantic relationships: Failure to bond during teenage years and peer pressure to be in a relationship can result in higher tendency to drink.
Current health status: It might be a causal factor for drinking but also vice versa i.e. excessive drinking could be related to bad health status.
Parental cohabitation and romantic relationship are binomial variables. Mother’s eductation and health are categporical, ordinal variables. The bar charts showing their distributions are presented below.
The charts show the distribution of subects based on their relationship status, parental cohabitation and usage of alcohol.
## # A tibble: 4 x 3
## # Groups: Pstatus [2]
## Pstatus high_use count
## <fct> <lgl> <int>
## 1 A FALSE 26
## 2 A TRUE 12
## 3 T FALSE 242
## 4 T TRUE 102
## # A tibble: 4 x 3
## # Groups: romantic [2]
## romantic high_use count
## <fct> <lgl> <int>
## 1 no FALSE 180
## 2 no TRUE 81
## 3 yes FALSE 88
## 4 yes TRUE 33
None of the chosen variables are significantly associated with high alchohol use. The variables and their resulting odds ratios for high alcohol use are presented along with the confidence intervals. Odds ratios along with confidence intervals for association of each variable with high alcohol use are also presented,
##
## Call:
## glm(formula = high_use ~ Pstatus + Medu + romantic + health,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4627 -0.8821 -0.7091 1.2911 2.0228
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.10674 1.32994 0.080 0.936
## PstatusT -0.08875 0.38587 -0.230 0.818
## Medu1 -0.91856 1.27172 -0.722 0.470
## Medu2 -1.90163 1.26062 -1.508 0.131
## Medu3 -0.91270 1.25197 -0.729 0.466
## Medu4 -1.27811 1.24739 -1.025 0.306
## romanticyes -0.15946 0.25057 -0.636 0.525
## health2 0.76463 0.48160 1.588 0.112
## health3 0.13564 0.44268 0.306 0.759
## health4 0.32219 0.45049 0.715 0.474
## health5 0.63142 0.39548 1.597 0.110
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 448.12 on 371 degrees of freedom
## AIC: 470.12
##
## Number of Fisher Scoring iterations: 4
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 1.1126471 0.086924876 27.053858
## PstatusT 0.9150754 0.437244590 2.007650
## Medu1 0.3990923 0.017558764 4.548870
## Medu2 0.1493248 0.006654147 1.665484
## Medu3 0.4014386 0.018078594 4.410220
## Medu4 0.2785625 0.012611116 3.033450
## romanticyes 0.8526065 0.517603874 1.385456
## health2 2.1482059 0.843454699 5.637405
## health3 1.1452705 0.486362057 2.792529
## health4 1.3801401 0.576722981 3.413775
## health5 1.8802695 0.888022503 4.233663
I chose study time, activities and going out with friends. Distributions of these variables are presented through bar charts.
Based on the distributions I chose to explore gender, study time, activities and going out. Gender and going out were significantly associated with high alcohol use. Study time also had a significant but weak association.
Males were more likely to drink heavily (OR 2.24 CI95% 1.32 - 3.84). Compared to the reference group, students who went out most frequently were more likely to consume alcohol heavily (OR 10.73 CI 95% 3.036 - 51.42)
##
## Call:
## glm(formula = high_use ~ sex + studytime + activities + goout,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7617 -0.7725 -0.5228 0.8228 2.5617
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.8672 0.6722 -2.778 0.005474 **
## sexM 0.8073 0.2721 2.966 0.003012 **
## studytime2 -0.3456 0.2970 -1.164 0.244524
## studytime3 -0.9712 0.4809 -2.019 0.043455 *
## studytime4 -1.1241 0.6298 -1.785 0.074310 .
## activitiesyes -0.4044 0.2590 -1.561 0.118454
## goout2 0.3503 0.6956 0.504 0.614510
## goout3 0.5243 0.6813 0.770 0.441507
## goout4 2.0665 0.6816 3.032 0.002430 **
## goout5 2.3735 0.7020 3.381 0.000722 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 384.35 on 372 degrees of freedom
## AIC: 404.35
##
## Number of Fisher Scoring iterations: 4
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 0.1545556 0.03368531 0.5095379
## sexM 2.2418487 1.32119716 3.8496160
## studytime2 0.7077842 0.39492190 1.2688956
## studytime3 0.3786406 0.14036906 0.9407166
## studytime4 0.3249548 0.08308952 1.0313723
## activitiesyes 0.6673741 0.39987596 1.1063789
## goout2 1.4195109 0.40277427 6.7023293
## goout3 1.6893594 0.49743864 7.8225116
## goout4 7.8972057 2.33914417 36.6844868
## goout5 10.7354006 3.03673066 51.4240176
Previously significant variables (sex, studying time, going out) were chosen for this model. The chart represents the high/non “high users” that were users that were classified correctly by the model as high users that were classified correctly or incorrectly. The tabulated chart shows the proportion of subjects classified correctly and incorrectly by the model.
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.65183246 0.04973822 0.70157068
## TRUE 0.15445026 0.14397906 0.29842932
## Sum 0.80628272 0.19371728 1.00000000
Loss function of the training model was:
## [1] 0.2041885
If a simple guessing strategy would be employed, there would be a 50 percent change of getting a correct answer since the outcome is a binomial outcome. The model however is more efffective as it gives a correct result 80 percent of the time, as demonstrated through the training set.
## [1] 0.2303665
The 10 fold cross validation test resulted in a 0.22 error for my model, which is less than 0.26 for the model introduced in DataCamp.
Let us fit all the variables into a logistic regression model.
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
## [36] "probability" "prediction"
##
## Call:
## glm(formula = high_use ~ school + sex + studytime + goout + Pstatus +
## Medu + Fedu + Mjob + Fjob + reason + nursery + internet +
## guardian + traveltime + studytime + failures + schoolsup +
## famsup + paid + activities + higher + romantic + famrel +
## freetime + goout + health + absences, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9372 -0.6024 -0.2891 0.3696 2.9525
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -16.00788 998.11272 -0.016 0.987204
## schoolMS -0.31881 0.56801 -0.561 0.574610
## sexM 1.14045 0.36391 3.134 0.001725 **
## studytime2 -0.10816 0.39602 -0.273 0.784764
## studytime3 -0.58458 0.62948 -0.929 0.353053
## studytime4 -0.51937 0.81835 -0.635 0.525648
## goout2 0.35067 0.92424 0.379 0.704381
## goout3 0.39244 0.89417 0.439 0.660744
## goout4 2.48140 0.90743 2.735 0.006247 **
## goout5 2.57466 0.94753 2.717 0.006583 **
## PstatusT -0.64279 0.57431 -1.119 0.263036
## Medu1 -1.46770 1.63389 -0.898 0.369033
## Medu2 -2.29763 1.63061 -1.409 0.158818
## Medu3 -0.97329 1.65010 -0.590 0.555302
## Medu4 -1.11258 1.71722 -0.648 0.517053
## Fedu1 13.12642 998.10935 0.013 0.989507
## Fedu2 13.35623 998.10933 0.013 0.989323
## Fedu3 13.14915 998.10933 0.013 0.989489
## Fedu4 13.46401 998.10938 0.013 0.989237
## Mjobhealth -1.49344 0.86167 -1.733 0.083061 .
## Mjobother -0.81758 0.57177 -1.430 0.152744
## Mjobservices -1.17488 0.64412 -1.824 0.068150 .
## Mjobteacher -0.54941 0.80208 -0.685 0.493354
## Fjobhealth 0.51742 1.31546 0.393 0.694071
## Fjobother 1.19438 0.99984 1.195 0.232255
## Fjobservices 1.97809 1.02534 1.929 0.053705 .
## Fjobteacher -0.58137 1.19604 -0.486 0.626915
## reasonhome 0.58678 0.43925 1.336 0.181590
## reasonother 1.59275 0.60414 2.636 0.008379 **
## reasonreputation 0.12521 0.45818 0.273 0.784633
## nurseryyes -0.38504 0.41007 -0.939 0.347750
## internetyes -0.18085 0.47879 -0.378 0.705634
## guardianmother -0.65645 0.39486 -1.662 0.096413 .
## guardianother -0.70067 0.94726 -0.740 0.459495
## traveltime2 -0.30249 0.38375 -0.788 0.430558
## traveltime3 1.91997 0.80682 2.380 0.017328 *
## traveltime4 3.26053 1.12585 2.896 0.003779 **
## failures 0.17111 0.29453 0.581 0.561252
## schoolsupyes -0.35779 0.50971 -0.702 0.482709
## famsupyes -0.08541 0.35824 -0.238 0.811558
## paidyes 0.67650 0.35640 1.898 0.057673 .
## activitiesyes -0.35070 0.34488 -1.017 0.309210
## higheryes -0.11162 0.78940 -0.141 0.887556
## romanticyes -0.45218 0.36731 -1.231 0.218300
## famrel2 0.67295 1.40011 0.481 0.630773
## famrel3 0.63499 1.27185 0.499 0.617593
## famrel4 -0.54507 1.24670 -0.437 0.661956
## famrel5 -1.36245 1.26822 -1.074 0.282688
## freetime2 1.02423 1.12818 0.908 0.363952
## freetime3 1.57764 1.09818 1.437 0.150831
## freetime4 2.01108 1.12377 1.790 0.073521 .
## freetime5 1.50294 1.19980 1.253 0.210329
## health2 1.09810 0.72431 1.516 0.129504
## health3 0.50881 0.68053 0.748 0.454660
## health4 0.74525 0.68633 1.086 0.277547
## health5 0.97885 0.61138 1.601 0.109368
## absences 0.10369 0.03037 3.415 0.000639 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 292.37 on 325 degrees of freedom
## AIC: 406.37
##
## Number of Fisher Scoring iterations: 14
## [1] 0.1910995
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] 0.2539267
The training error of this model was 0.19 and the tesing error was 0.24
The significant variables were selected and another model was created.
##
## Call:
## glm(formula = high_use ~ sex + goout + reason + traveltime +
## absences, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7762 -0.7111 -0.4500 0.6284 2.5265
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.31613 0.76071 -4.359 1.31e-05 ***
## sexM 0.97063 0.27099 3.582 0.000341 ***
## goout2 0.49127 0.74900 0.656 0.511890
## goout3 0.54824 0.73191 0.749 0.453824
## goout4 2.23550 0.73810 3.029 0.002456 **
## goout5 2.35217 0.75767 3.104 0.001906 **
## reasonhome 0.36755 0.33529 1.096 0.272983
## reasonother 1.15017 0.46511 2.473 0.013403 *
## reasonreputation -0.16018 0.35498 -0.451 0.651819
## traveltime2 -0.11500 0.30300 -0.380 0.704275
## traveltime3 1.47622 0.57538 2.566 0.010298 *
## traveltime4 1.87986 0.92470 2.033 0.042057 *
## absences 0.08835 0.02377 3.717 0.000202 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 360.59 on 369 degrees of freedom
## AIC: 386.59
##
## Number of Fisher Scoring iterations: 4
## [1] 0.1884817
## [1] 0.2068063
This model had a better performance on the testing set, i.e. testing error was 0.21. Finally I decided to remove reason (to choose school) and travel time variables (because they were weak predictors and the category other for reason variable was hard to determine).
##
## Call:
## glm(formula = high_use ~ sex + goout + absences, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7601 -0.7278 -0.4717 0.7663 2.2870
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.89554 0.71197 -4.067 4.76e-05 ***
## sexM 1.02291 0.26103 3.919 8.90e-05 ***
## goout2 0.35631 0.73254 0.486 0.626685
## goout3 0.42736 0.71684 0.596 0.551053
## goout4 1.99158 0.71679 2.778 0.005461 **
## goout5 2.25925 0.73593 3.070 0.002141 **
## absences 0.08396 0.02278 3.685 0.000229 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 380.07 on 375 degrees of freedom
## AIC: 394.07
##
## Number of Fisher Scoring iterations: 4
## [1] 0.2068063
## [1] 0.2146597
Final model had 3 predictors and a testing set loss function of 0.21.
## n_variables training_error
## 1 27 0.1910995
## 2 5 0.1884817
## 3 3 0.2068063
## n_variables testing_error
## 1 27 0.2539267
## 2 5 0.2068063
## 3 3 0.2146597
library(dplyr)
library(tidyr)
library(ggplot2)
library(GGally)
library(boot)
library(MASS)
library(corrplot)
library(tidyverse)
library(plotly)
Now we shall load the dataset Boston from MASS package and explore it.
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
dim(Boston)
## [1] 506 14
The data has 14 variables and 506 measurements. Each measurement corresponds to a suburb of Boston i.e. 506 suburbs in total for this dataset. The variables are described in detail at https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html
Data contains 12 numeric variables and 2 integer variables.
The variables arevarlabels = c("per capita crime rate","proportion of residential land zoned","proportion of industrial zones per town","towns bordering charles river N/Y","NO concentration, PP10M","average number of rooms per dwelling","proportion of units built pre-1940", "weighted means of distance to 5 Boston employment centres","index of accessibility to radial highways","property tax rate per $10,000","pupil to teacher ratio", "Proortion of blacks by town","lower status of population","median value of homes in $1000s")
selcols <- colnames(Boston)
selvars <- dplyr::select(Boston, one_of(selcols))
a=0
for(var in selvars) {
if(is.integer(var)){
a = a +1
plot <-ggplot(Boston, aes(var, ..count..)) + geom_bar() + xlab(varlabels[a]) + ylab("Number of suburbs")
print(plot)
} else {
a = a +1
plot <- qplot(var,
geom="histogram",
binwidth = ((1/7.9) * sd(var)) * 3.49,
main = paste("Histogram for", varlabels[a]) ,
xlab = (varlabels[a]),
fill=I("blue"),
col=I("red"),
alpha=I(.2))
print(plot)
}
}
cor_matrix<-cor(Boston)
corrplot(cor_matrix, method="circle", type="upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
The correlation matrix shows that the strongest positive correlations are between 1. Accessibility to highways and property tax rate, 2. Industrialization and NO concentration and 3. Number of rooms per dwelling and median value of homes. The strongest negative corellations are between 1. Proportion of buildings built prior to 1940 and distances to employment centres, 2. NO concentration and proportion of buildings built prior to 1940, 3. Proportion of industrial zones and distances to employment centres and 4 Percent lower status of population and median value of owner occupied homes.
Already many patterns can be drawn from this data. Accessibility to highways means higher development and could mean higher property tax rates. More industrial zones can explain higher NO concentrations. Number of rooms per dwelling could explain the median value of homes as larger homes are more expensive. Negative relationship between percent of lower status population and median value of homes shows that richer 0suburbs have wealthier people.
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
boston_scaled <- as.data.frame(boston_scaled)
bins <- quantile(boston_scaled$crim)
label <- c("low", "med_low", "med_high", "high")
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = label)
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
After standardizing the data, the mean of all variables has been centered towards the zero and the range has decreased, but still maintaining similar proportions between them.
Now we shall split the data into train and test set.
n <- nrow(boston_scaled)
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
dim(train)
## [1] 404 14
dim(test)
## [1] 102 14
Drawing the LDA bi-plot. Categorical crime rate variable was used as the target and all other variables were used as predictors.
lda.fit <- lda(crime ~ ., data = train)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 3)
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class) %>% addmargins()
## predicted
## correct low med_low med_high high Sum
## low 13 18 1 0 32
## med_low 0 21 5 0 26
## med_high 1 4 14 1 20
## high 0 0 0 24 24
## Sum 14 43 20 25 102
When testing the trained model on test dataset, the model predicted correctly 9 out of 23 low crime, 15 out of 26 med_low crime, 15 out of 23 med_high crime and 30 out of 30 high crime suburbs.
data('Boston')
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
dist_eu <- dist(boston_scaled, method = "euclidean")
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
km <-kmeans(boston_scaled, centers = 3)
pairs(boston_scaled[6:10], col = km$cluster)
Checking the optimum number of clusters and re-analyzing
set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')
km <-kmeans(boston_scaled, centers = 4)
pairs(boston_scaled[6:10], col = km$cluster)
I could not find a clear drop in WCSS after two so I set the number of clusters at 3.
If we draw a pairs plot, we can see that we cannot separate the neighborhoods based on two variables. Some variables are good at separating 2 clusters, but almost no combination of two variables can separate three clusters in a good way.
So why don’t we try separating the clusters based on LDA instead of pairs? I chose 4 clusters for this purpose because it gives a good speeration of data.
data('Boston')
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
set.seed(123)
km <- kmeans(boston_scaled, centers = 4)
lda.fit <- lda(km$cluster~ ., data = boston_scaled)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(km$cluster)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 7)
According to the results from the LDA plot, proportion of black people in town, nitric oxide concentrations, industrialization, tax rate and crime rate were the biggest factors separating suburbs into four categories.
Higher crime rate clustered the suburbs in cluster one. Lesser (or Higher? unclear from the variable description and formula) proportion of black people clustered the towns in the opposite direction. NO concentration, industrialization and tax rate clustered suburbs towards left, i.e. clusters 1 and 2.It can also be seen that these zones were more accessible to radial highways.
We predicted crime rate category using other variables and plotted it in 3D!
The individual points were colored in the first graph according to the crime rate category and in the second graph using k means clustering (data was clustered into 4 groups in order to match with four categories of crime rate).
Even though in the second 3D graph the data was clustered using all available data and not specifically to predict crime, note that it still does a pretty good job in separating suburbs into four groups according to crime rate.
#PLOTTING 3D GRAPH FOR TRAIN SET WITH CRIME
data('Boston')
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
n <- nrow(boston_scaled)
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
km <- kmeans(train, centers = 4)
bins <- quantile(train$crim)
label <- c("low", "med_low", "med_high", "high")
crime <- cut(train$crim, breaks = bins, include.lowest = TRUE, label = label)
train <- data.frame(train, crime)
lda.fit <- lda(crime ~ ., data = train)
classes <- as.numeric(train$crime)
model_predictors <- dplyr::select(train, -crime)
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
library(plotly)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km$cluster)